Estimating Reference Scopes of Wikipedia Article Inner-links

نویسندگان

  • Renzhi Wang
  • Mizuho Iwaihara
چکیده

Wikipedia is the largest online encyclopedia, and utilized as machine-knowledgeable and semantic resources. Links within Wikipedia indicate that two articles or parts of them related about their topics. Existing link detection methods focus on article titles because most of links in Wikipedia point to article titles. But there are a number of links in Wikipedia pointing to corresponding segments, because the whole article is too general and it is hard for readers to obtain the intention of the link. We propose a method to automatically predict whether the link target is a specific segment and provide which segment is most relevant. We propose a combination method of Latent Dirichlet Allocation (LDA) and Maximum Likelihood Estimation (MLE) to represent every segment as a vector, and then we obtain similarity of each segment pair. Finally we utilize variance, standard deviation and other statistical features to predict the results. We also try Word2Vector model to embed all the segments into a semantic space and calculate cosine similarities between segment pairs, then we utilize Random Forest to train a classifier to predict link scopes. Through evaluations on Wikipedia articles, our method achieved reasonable results. Keyword Wikipedia, link suggestion, LDA, word2vector, PMI

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sense and Reference Disambiguation in Wikipedia

Wikipedia articles are annotated by volunteer contributors with numerous links that connect words and phrases to relevant titles in Wikipedia. In this paper, we identify inconsistencies in the user annotation of links and show that they can have a substantial impact on the performance of word sense disambiguation systems that are trained on Wikipedia links. We describe two major types of link a...

متن کامل

Finding titles representing segments of Wikipedia Articles from keyphrases

Wikipedia is a free online encyclopedia that aims to allow anyone to edit any article or create them. However, articles tend to become long and complex, so giving appropriate titles or key phrases to untitled segments is necessary for reader assistance. In this paper, we show methods to select titles for representing article segments. Key phrase extraction has been studied for years, but we con...

متن کامل

Boosting Cross-Lingual Knowledge Linking via Concept Annotation

Automatically discovering cross-lingual links (CLs) between wikis can largely enrich the cross-lingual knowledge and facilitate knowledge sharing across different languages. In most existing approaches for cross-lingual knowledge linking, the seed CLs and the inner link structures are two important factors for finding new CLs. When there are insufficient seed CLs and inner links, discovering ne...

متن کامل

Wikiwhere: An interactive tool for studying the geographical provenance of Wikipedia references

Wikipedia articles about the same topic in different language editions are built around different sources of information. For example, one can find very different news articles linked as references in the English Wikipedia article titled “Annexation of Crimea by the Russian Federation” than in its German counterpart (determined via Wikipedia’s language links). Some of this difference can of cou...

متن کامل

The Task of Automatic Documents Clustering

In this paper we describe a new unsupervised algorithm for automatic documents clustering with the aid of Wikipedia. Contrary to other related algorithms in the field, our algorithm utilizes only two aspects of Wikipedia, namely its categories network and articles titles. We do not utilize the inner content of the articles in Wikipedia or their inner or inter links. The implemented algorithm wa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017